1 Extended description of experiment
The Christmas 2018 statistical cognition experiment ran continuously from 16 December 2018 to 1 January 2019. Participants were recruited via social media (particularly Twitter and Facebook). We used Qualtrics to deploy the experiment, which was written in HTML/CSS/Javascript. Participants were asked to perform a series of fictitious experiments with a two-group design and come to a conclusion regarding which of the two groups was “faster”. The way the fictitious results were reported to the participants was unusual — no numbers were given — to test participants’ ability to use significance testing logic.
Of particular interest to us were:
- whether participants sought information relevant to a significance test, and ignored irrelevant information,
- whether participants could come to the right conclusion with high probability,
- whether participants’ conclusions were reasonable given the information they were given, and
- whether participants’ descriptions of their strategies were consistent with significance testing logic.
1.1 Basic task setup
| Toy name | Hidden effect size (δ) | Probability |
|---|---|---|
| whizbang balls | 0.00 | 0.25 |
| constructo bricks | 0.10 | 0.11 |
| rainbow clickers | 0.19 | 0.11 |
| doodle noodles | 0.30 | 0.11 |
| singing bling rings | 0.43 | 0.11 |
| brahma buddies | 0.60 | 0.11 |
| magic colorclay | 0.79 | 0.11 |
| moon-candy makers | 1.00 | 0.11 |
Participants were randomly assigned to one of two evidence powers (wide, \(q=3\); or narrow, \(q=7\)) with equal probability. Participants were also randomly assigned to one of eight “true” effect sizes. Because the behaviour of participants when there is no true effect was particularly of interest, the probability of assignment to no effect (\(\delta=0\)) was 25%. Across the seven other effect sizes listed in Table 1.1, the remaining 75% probability was evenly distributed.
The cover story (which can be read here) presented a problem in which it was desired to know which of two groups of elves (“Sparklies” or “Jinglies”) was faster. Participants were presented with the results of fictitious experiments as they requested them.
Participants could increase or decrease the sample size for the experiments as well; importantly, they were not aware of the actual sample size. Participants could adjust the sample size with a slider that had 20 divisions. The corresponding 20 hidden sample sizes are shown in Table 1.2.
| n index | n | Time (s) |
|---|---|---|
| 1 | 10 | 1 |
| 2 | 12 | 2 |
| 3 | 14 | 2 |
| 4 | 16 | 2 |
| 5 | 19 | 2 |
| 6 | 22 | 3 |
| 7 | 26 | 3 |
| 8 | 30 | 3 |
| 9 | 35 | 4 |
| 10 | 41 | 5 |
| 11 | 48 | 5 |
| 12 | 57 | 6 |
| 13 | 66 | 7 |
| 14 | 78 | 8 |
| 15 | 91 | 10 |
| 16 | 106 | 11 |
| 17 | 125 | 13 |
| 18 | 146 | 15 |
| 19 | 171 | 18 |
| 20 | 200 | 20 |
The results were returned to the participants in a visual fashion: instead of being presented with a test statistic, each result was associated with a color intensity (from white to red) and a horizontal location that we will describe as left (-1) to right (1). Results on the far left were red and were associated with maximum evidence for the Sparklies being faster; results in the center were white were not evidence for either group; and results on the right were again red and were evidence for the Jinglies being faster. The intensity of the red color was defined by a linear gradient on the transparency (alpha) from 0 to -1 or 1 as defined using CSS.1 The resulting horizintal axis is shown in the figure below.
Figure 1.1: The interface on which fictitious results were presented to the participants.
1.2 Underlying statistical details
Each fictitious result was the result of applying a transformation to randomly sampled \(Z\) statistic. The distribution of the \(Z\) statistic was a function of the randomly-assigned (but unknown to the participant) effect size \(\delta\) and a group sample size \(n\) (adjustable by, but unknown to, the participant).
\[ Z \sim \mbox{Normal}(\delta\sqrt{n/2}, 1) \]
The \(x\) location of the result, and hence the color, was then defined by the transformation: \[ x = \mbox{sgn}(Z)\left[1 - \left(1 - F_{\chi_1^2}\left(Z^2\right)\right)^{\frac{1}{q}}\right] \] where evidence power \(q\in\{3,7\}\), the location \(x \in (-1,1)\), and \(F_{\chi_1^2}\) is the cumulative distribution function of the \(\chi_1^2\) distribution.
The figure below shows the transformation from \(Z\) statistics (top axis) and one-sided \(p\) values (bottom axis) to \(x\) locations.
Figure 1.2: Transformation between traditional test statistics (\(Z\), \(p\)) and the \(x\) location
Figure 1.3: Evidence distributions as functions of underlying effect size, for non-negative effect sizes. Distributions from left to right correspond to increasingly large true effect sizes. Evidence distributions for negative effect sizes were mirror images of these distributions. A: Smallest sample size, wide evidence distribution; B: Largest sample size, wide evidence distribution; C: Smallest sample size, narrow evidence distribution; D: Largest sample size, narrow evidence distribution
Figure 1.4: Selected two-sided \(p\) values for the null distribution of the evidence for the narrow evidence distribution (\(q=7\))
Figure 1.5: Selected two-sided \(p\) values for the null distribution of the evidence for the wide evidence distribution (\(q=3\))